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Abstract 

In recent years, spectral clustering has become one of the most popular modern clustering 
algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, 
and very often outperforms traditional clustering algorithms such as the k-means algorithm. On 
the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why 
it works at all and what it really does. The goal of this tutorial is to give some intuition on 
those questions. We describe difl'orent graph Laplacians and their basic properties, present the 
most common spectral clustering algorithms, and derive those algorithms from scratch by several 
different approaches. Advantages and disadvantages of the different spectral clustering algorithms 
are discussed. 
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1 Introduction 

Clustering is one of the most widely used techniques for exploratory data analysis, with applications 
ranging from statistics, computer science, biology to social sciences or psychology. In virtually every 
scientific field dealing with empirical data, people attempt to get a first impression on their data by 
trying to identify groups of "similar behavior" in their data. In this article we would like to introduce 
the reader to the family of spectral clustering algorithms. Compared to the "traditional algorithms" 
such as A:-means or single linkage, spectral clustering has many fundamental advantages. Results ob- 
tained by spectral clustering often outperform the traditional approaches, spectral clustering is very 
simple to implement and can be solved efficiently by standard linear algebra methods. 

This tutorial is set up as a self-contained introduction to spectral clustering. We derive spectral 

clustering from scratch and present different points of view to why spectral clustering works. Apart 
from basic linear algebra, no particular mathematical background is required by the reader. However, 
we do not attempt to give a concise review of the whole literature on spectral clustering, which is 
impossible due to the overwhelming amount of literature on this subject. The first two sections 
are devoted to a step-by-step introduction to the mathematical objects used by spectral clustering: 
similarity graphs in Section 2, and graph Laplacians in Section 3. The spectral clustering algorithms 
themselves will be presented in Section 4. The next three sections arc then devoted to explaining 
why those algorithms work. Each section corresponds to one explanation: Section 5 describes a graph 
partitioning approach. Section 6 a random walk perspective, and Section 7 a perturbation theory 
approach. In Section 8 we will study some practical issues related to spectral clustering, and discuss 
various extensions and literature related to spectral clustering in Section 9. 



2 Similarity graphs 



Given a set of data points xi, . . .Xn and some notion of similarity Sij > between all pairs of data 
points Xi and xj , the intuitive goal of clustering is to divide the data points into several groups such 
that points in the same group are similar and points in different groups are dissimilar to each other. If 
we do not have more information than similarities between data points, a nice way of representing the 
data is in form of the similarity graph G = (F, E). Each vertex Vi in this graph represents a data point 
Xi- Two vertices are connected if the similarity Sij between the corresponding data points Xi and Xj is 
positive or larger than a certain threshold, and the edge is weighted by Sij. The problem of clustering 
can now be reformulated using the similarity graph: we want to find a partition of the graph such 
that the edges between different groups have very low weights (which means that points in different 
clusters are dissimilar from each other) and the edges within a group have high weights (which means 
that points within the same cluster are similar to each other) . To be able to formalize this intuition we 
first want to introduce some basic graph notation and briefly discuss the kind of graphs we are going 
to study. 

2.1 Graph notation 

Let G = (y, E) be an undirected graph with vertex set V = {f i, . . . , f„}. In the following we assume 
that the graph G is weighted, that is each edge between two vertices Vi and Vj carries a non-negative 
weight Wij > 0. The weighted adjacency matrix of the graph is the matrix W = iu'ij)i j^i_ If 
Wij = this means that the vertices Wj and Vj are not connected by an edge. As G is undirected we 
require Wij = Wji. The degree of a vertex Vi &V is deflned as 

n 

dj = ^ Wij. 

j=i 

Note that, in fact, this sum only runs over all vertices adjacent to Wj, as for all other vertices Vj the 
weight Wij is 0. The degree matrix D is defined as the diagonal matrix with the degrees d\,...,dn 
on the diagonal. Given a subset of vertices A G V , wc denote its complement V \ A hy A. We 
define the indicator vector 1a = (/i, • • • ,/«)' G R" as the vector with entries fi = 1 \i Vi & A and 
fi = otherwise. For convenience we introduce the shorthand notation i G A fov the set of indices 
{i \ Vi £ A} , in particular when dealing with a sum like J2ieA '^ij- ■^o'^ ^'^^ necessarily disjoint sets 
A,B (ZV we define 

W{A,B):= ^ii- 

ieAjeB 

We consider two different ways of measuring the "size" of a subset A CV: 

\A\ := the number of vertices in A 
vol(A) ■.= Y,d^■ 

ieA 

Intuitively, |^| measures the size of A by its number of vertices, while vol{A) measures the size of A 
by summing over the weights of all edges attached to vertices in A. A subset A c F of a graph is 
connected if any two vertices in A can be joined by a path such that all intermediate points also lie 
in A. A subset A is called a connected component if it is connected and if there are no connections 
between vertices in A and A. The nonempty sets Ai,. . . ,Ak form a partition of the graph if Ai (lAj =0 
a.nd Ai U ... U Ak = V. 
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2.2 Different similarity graphs 



There are several popular constructions to transform a given set . . . , a;„ of data points with pairwise 
similarities Sij or pairwise distances dij into a graph. When constructing similarity graphs the goal is 
to model the local neighborhood relationships between the data points. 

The £-neighborhood graph: Here we connect all points whose pairwise distances are smaller than e. 

As the distances between all connected points are roughly of the ik; (at most e), weighting the 

edges would not incorporate more information about the data to the graph. Hence, the £- neighborhood 
graph is usually considered as an unweighted graph. 

fc-nearest neighbor graphs: Here the goal is to connect vertex Vi with vertex Vj if Vj is among 
the fc- nearest neighbors of w,. However, this definition leads to a directed graph, as the neighborhood 
relationship is not symmetric. There arc two ways of making this graph undirected. The first way is 
to simply ignore the directions of the edges, that is we connect Vi and Vj with an undirected edge if Vi 
is among the fc-nearest neighbors of Vj or if vj is among the A;- nearest neighbors of Vi. The resulting 
graph is what is usually called the k-nearest neighbor graph. The second choice is to connect vertices 
Vi and Vj if both Wj is among the fc-nearest neighbors of vj and Vj is among the /c-nearest neighbors of 
Vi. The resulting graph is called the mutual k-nearest neighbor graph. In both cases, after connecting 
the appropriate vertices we weight the edges by the similarity of their endpoints. 

The fully connected graph: Here we simply connect all points with positive similarity with each 
other, and we weight all edges by Sij. As the graph should represent the local neighborhood re- 
lationships, this construction is only useful if the similarity function itself models local neighbor- 
hoods. An example for such a similarity function is the Gaussian similarity function s{xi,Xj) = 
exp(— — .T, |p/(2(T^)), where the parameter a controls the width of the neighborhoods. This pa- 
rameter plays a similar role as the parameter e in case of the £-neighborhood graph. 

All graphs mentioned above are regularly used in spectral clustering. To our knowledge, theoretical 
results on the question how the choice of the similarity graph influences the spectral clustering result 
do not exist. For a discussion of the behavior of the different graphs we refer to Section 8. 



3 Graph Laplacians and their basic properties 

The main tools for spectral clustering are graph Laplac;ian matrices. There exists a whole fleld ded- 
icated to the study of those matrices, called spectral graph theory (e.g., see Chung, 1997). In this 
section we want to define different graph Laplacians and point out their most important properties. 
We will carefully distinguish between different variants of graph Laplac;ians. Note that in the literature 
there is no unique convention which matrix exactly is called "graph Laplacian" . Usually, every author 
just calls "his" matrix the graph Laplacian. Hence, a lot of care is needed when reading literature on 
graph Laplacians. 

In the following we always assume that G is an undirected, weighted graph with weight matrix W, 
where Wij = Wji > 0. When using eigenvectors of a matrix, we will not necessarily assume that they 
are normalized. For example, the constant vector 1 and a multiple al for some will be considered 
as the same eigenvectors. Eigenvalues will always be ordered increasingly, respecting multiplicities. 
By "the first k eigenvectors" we refer to the eigenvectors corresponding to the k smallest eigenvalues. 
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3.1 The unnormalized graph Laplacian 

The unnormalized graph Laplacian matrix is defined as 

L = D-W. 

An overview over many of its properties can be found in Mohar (1991, 1997). The following proposition 

summarizes the most important facts needed for spectral clustering. 

Proposition 1 (Properties of L) The matrix L satisfies the following properties: 
1. For every vector f gR" we have 



1 " 

f'Lf=-J2^0ij{fi-fjf- 

2. L is symmetric and positive semi-definite. 

3. The smallest eigenvalue of L is 0, the corresponding eigenvector is the constant one vector 1. 
4- L has n non-negative, real-valued eigenvalues = Ai<A2<...<A„. 

Proof. 

Part (1): By the definition of di, 

n n 

f'Lf = f'Df - fWf = J2 '^^^i - E 

1=1 ii3 = l 

^ I n n n \ ^ n 

= 2 E - 2 E z^/^-^^^- + E d^f! = 2 E «'^^(/^ - z^)'- 

Part (2): The symmetry of L follows directly from the symmetry of W and D. The positive semi- 
definiteness is a direct consequence of Part (1), which shows that f'Lf > for all / e R". 
Part (3): Obvious. 

Part (4) is a direct consequence of Parts (1) - (3). □ 



Note that the unnormalized graph Laplacian does not depend on the diagonal elements of the adja- 
cency matrix W. Each adjacency matrix which coincides with W on all oH'-diagonal positions leads 
to the same unnormalized graph Laplacian L. In particular, self-edges in a graph do not change the 
corresponding graph Laplacian. 

The unnormalized graph Laplacian and its eigenvalues and eigenvectors can be used to describe many 
properties of graphs, see Mohar (1991, 1997). One example which will be important for spectral 
clustering is the following: 

Proposition 2 (Number of connected components and the spectrum of L) Let G he an undi- 
rected graph with non-negative weights. Then the multiplicity k of the eigenvalue Q of L equals the 
number of connected components Ai, . . . ^A^ in the graph. The eigenspace of eigenvalue is spanned 
by the indicator vectors Ia-^ , • • • , of those components. 

Proof. We start with the case k = 1, that is the graph is connected. Assume that / is an eigenvector 
with eigenvalue 0. Then we know that 

n 

= f'Lf= Y,w,j{f,-fjf. 
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As the weights Wij are non-negative, this sum can only vanish if all terms 'Wij{fi — fj)"^ vanish. Thus, 
if two vertices Vi and vj are connected (i.e., Wij > 0), then fi needs to equal fj. With this argument 
we can see that / needs to be constant for all vertices which can be connected by a path in the graph. 
Moreover, as all vertices of a connected component in an undirected graph can be connected by a 
path, / needs to be constant on the whole connected component. In a graph consisting of only one 
c;ounecte(l c;oniponent we thus only have the constant one vector 1 as eigenvector with eigenvalue 0, 
which obviously is the indicator vector of the connected component. 



Now consider the case of k connected components. Without loss of generality we assume that the 
vertices are ordered according to the connected components they belong to. In this case, the adjacency 
matrix W has a block diagonal form, and the same is true for the matrix L: 



L = 



\ 



\ 



LkJ 



Note that each of the blocks Li is a proper graph Laplacian on its own, namely the Laplacian corre- 
sponding to the subgraph of the i-th connected component. As it is the case for all block diagonal 
matrices, we know that the spectrum of L is given by the union of the spectra of Li, and the corre- 
sponding eigenvectors of L are the eigenvectors of Li, filled with at the positions of the other blocks. 
As each Li is a graph Laplacian of a connected graph, we know that every Li has eigenvalue with 
multiplicity 1, and the corresponding eigenvector is the constant one vector on the i-th connected 
component. Thus, the matrix L has as many eigenvalues as there are connected components, and 
the corresponding eigenvectors are the indicator vectors of the connected components. □ 



3.2 The normalized graph Laplacians 

There are two matrices which are called normalized graph Laplacians in the literature. Both matrices 
are closely related to each other and are defined as 

Lsym := D-^'^LD-^'^ = 1- D-^'^WD-^/^ 
L,^ := D-^L = I -D-^W. 

We denote the first matrix by Lsym as it is a symmetric matrix, and the second one by Lr^v as it is 
closely related to a random walk. In the following we summarize several properties of -Lgym and Xr^- 
The standard reference for normalized graph Laplacians is Chung (1997). 



Proposition 3 (Properties of isym and irw) The normalized Laplacians satisfy the following prop- 
erties: 

1. For every / e R" we have 




2. A is an eigenvalue of L^w with eigenvector u if and only if A is an eigenvalue of L^y^ with 
eigenvector w = 

3. A is an eigenvalue of L^w with eigenvector u if and only if A and u solve the generalized eigen- 
problem Lu = XDu. 
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4- is an eigenvalue of L^w with the constant one vector 1 as eigenvector. is an eigenvalue of 
Lsym with eigenvector D^^"^!. 

5- Lsym cind Lrw are positive semi- definite and have n non-negative real-valued eigenvalues = 
Ai < . . . < A„. 

Proof. Part (1) can be proved similarly to Part (1) of Proposition 1. 

Part (2) can be seen immediately by multiplying the eigenvalue equation LgymW = Xw with D~^/'^ 
from the left and substituting u = D~^^'^w. 

Part (3) follows directly by multiplying the eigenvalue equation Lj-wW = Au with D from the left. 
Part (4): The first statement is obvious as L-^v/l = 0, the second statement follows from (2). 
Part (5): The statement about igym follows from (1), and then the statement about L^y, follows from 
(2). □ 

As it is the (;:asc for the unnormalizcd graph Laplacian, the multiplicity of the eigenvalue of the 
normalized graph Laplacian is related to the number of connected components: 

Proposition 4 (Number of connected components and spectra of Lgym and Lrw) Let G be 

an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue of both L^w 
and Lsym equals the number of connected components Ai,. . . ,Ak in the graph. For L^w, the eigenspace 
of is spanned by the indicator vectors 1^^ of those components. For Lgym, the eigenspace of is 

spanned by the vectors D^/^Ia,- 

Proof. The proof is analogous to the one of Proposition 2, using Proposition 3. □ 



4 Spectral Clustering Algorithms 

Now we would like to state the most common spectral clustering algorithms. For references and the 
history of spectral clustering we refer to Section 9. We assume that our data consists of n "points" 
Xi,...,Xn which can be arbitrary objects. We measure their pairwise similarities Sij = s{xi,Xj) 
by some similarity function which is symmetric and non-negative, and we denote the corresponding 
similarity matrix hj S = {sij)ij=i...n. 



Unnormalized spectral clustering 

Input: Similarity matrix S'SlR"^", number k of clusters to construct. 

• Construct a similarity graph by one of the ways described in Section 2. Let W 
be its weighted adjacency matrix. 

• Compute the unnormalized Laplaciein L . 

• Compute the first k eigenvectors Ui, . . . , Ufc of L. 

• Let ?7 G R"^*^ be the matrix containing the vectors ui,...,Uk as columns. 

• For i = l,...,n, let yi e R*^ be the vector corresponding to the i-th row of U. 

• Cluster the points (2/i)i=i,...,n in with the fc-means algorithm into clusters 
Ci , . . . , C/s . 

Output : Clusters Ai,. . . ,Ak with Ai = {j\ yj € C,} . 



There are two different versions of normalized spectral clustering, depending which of the normalized 
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graph Laplacians is used. We name both algorithms after two popular papers, for more references and 
history please see Section 9. 



Normalized spectral clustering according to Shi and Malik (2000) 

Input: Similarity matrix S'SlR"^", number k of clusters to construct. 

• Construct a similarity graph by one of the ways described in Section 2. Let W 
be its weighted adjacency matrix. 

• Compute the unnormalized Laplaciein L. 

• Compute the first k generalized eigenvectors u^, . . . ,Uk of the generalized eigenprob- 
lem Lu = \Du. 

• Let U £ be the matrix containing the vectors Ui,...,Uk as columns. 

• For i = 1, . . . ,n, let yi G R'' be the vector corresponding to the i-th row of U. 

• Cluster the points (t/i)i=i,...,n in IR*^ with the fc-means algorithm into clusters 
Ci , . . . , Cfc . 

Output : Clusters Ai,...,Ak with Ai — {j\ yj S Cj} . 



Note that this algorithm uses the generalized eigenvectors of L, which according to Proposition 3 
correspond to the eigenvectors of the matrix Lrw So in fact, the algorithm works with eigenvectors of 
the normalized Laplacian L^v,, and hence is called normalized spectral clustering. The next algorithm 
also uses a normalized Laplacian, but this time the matrix Lsym instead of L^vj. As we will see, this 
algorithm needs to introduce an additional row normalization step which is not needed in the other 
algorithms. The reasons will become clear in Section 7. 



Normalized spectral clustering according to Ng, Jordan, and Weiss (2002) 

Input: Similarity matrix SelR"^", number k of clusters to construct. 

• Construct a similarity graph by one of the ways described in Section 2. Let W 
be its weighted adjacency matrix. 

• Compute the normalized Laplacian Lsym- 

• Compute the first k eigenvectors tii, . . . , M/, of Lsym- 

• Let U G R"^*^ be the matrix containing the vectors ui,...,Uk as columns. 

• Form the matrix T G from U by normalizing the rows to norm 1, 
that is set Uj = u,j/{Y,kuikf''^. 

• For i = l,...,n, let S R*^ be the vector corresponding to the i-th. row of T. 

• Cluster the points (t/j)i=i,...,n with the fc-means algorithm into clusters Ci,...,Cfc. 
Output : Clusters Ai,...,Ak with A^ = {j\ yj € Cj} . 



All three algorithms stated above look rather similar, apart from the fact that they use three different 
graph Laplacians. In all three algorithms, the main trick is to change the representation of the abstract 
data points Xi to points yi G R'^. It is due to the properties of the graph Laplacians that this change of 
representation is useful. We will see in the next sections that this change of representation enhances 
the cluster-properties in the data, so that clusters can be trivially detected in the new representation. 
In particular, the simple fc-means clustering algorithm has no difficulties to detect the clusters in this 
new representation. Readers not familiar with fc-means can read up on this algorithm in numerous 



7 



Histogram of the sample 



1 i. i i 



2 4 6 8 10 

Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 

0.5 





0.08 


c 
c 


0.06 


E 


0.04 


o 


0.02 







_^ 0.4 
E 

o 0.2 



-0.1 
-0.2 
-0.3 
-0.4 



0.4 



0.4 



1 23456789 10 

Eigenvalues 



2468 2468 2468 2468 2468 

Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 



c 


0.04 


c 


0.03 


E 


0.02 


)uun 


0.01 



2468 2468 2468 2468 2468 

Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 




123456789 10 2468 2468 2468 2468 2468 

Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 



O) 0.15 

a 0.1 

I 0.05 



-0.0707 
£ -0.0707 



23456789 10 






Figure 1: Toy example for spectral clustering where the data points have been drawn from a mixture of 
four Gaussians on R. Left upper corner: histogram of the data. First and second row: eigenvalues and 
eigenvectors of Lrw and L based on the fc-nearest neighbor graph. Third and fourth row: eigenvalues 
and eigenvectors of Lrw and L based on the fully connected graph. For all plots, we used the Gaussian 
kernel with tr = 1 as similarity function. See text for more details. 



text books, for example in Hastie, Tibshirani, and Friedman (2001). 

Before we dive into the theory of spectral clustering, we would like to illustrate its principle on a very 
simple toy example. This example will be used at several places in this tutorial, and we chose it because 
it is so simple that the relevant quantities can easily be plotted. This toy data set consists of a random 
sample of 200 points xi, . . . ,X2qo € R drawn according to a mixture of four Gaussians. The first row 
of Figure 1 shows the histogram of a sample drawn from this distribution (the x-axis represents the 
one-dimensional data space). As similarity function on this data set we choose the Gaussian similarity 
function s{xi,Xj) = exp(—\xi — Xj\'^ /(2a'^)) with a = 1. As similarity graph we consider both the 
fully connected graph and the 10-nearest neighbor graph. In Figure 1 we show the first eigenvalues 
and eigenvectors of the unnormalized Laplacian L and the normalized Laplacian irw That is, in the 
eigenvalue plot we plot i vs. (for the moment ignore the dashed line and the different shapes of the 
eigenvalues in the plots for the unnormalized case; their meaning will be discussed in Section 8.5). In 
the eigenvector plots of an eigenvector u = (ui, . . . , W200)' we plot Xi vs. Ui (note that in the example 
chosen Xi is simply a real number, hence we can depict it on the a;-axis). The first two rows of Figure 
1 show the results based on the 10-nearest neighbor graph. We can see that the first four eigenvalues 
are 0, and the corresponding eigenvectors are cluster indicator vectors. The reason is that the clusters 
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form disconnected paxts in the 10-nearest neighbor graph, in which case the eigenvectors are given as 
in Propositions 2 and 4. The next two rows show the results for the fully connected graph. As the 
Gaussian similarity function is always positive, this graph only consists of one connected component. 
Thus, eigenvalue has multiplicity 1, and the first eigenvector is the constant vector. The following 
eigenvectors carry the information about the clusters. For example in the unnormalized case (last 
row), if we threshold the second eigenvector at 0, then the part below corresponds to clusters 1 and 
2, and the part above to clusters 3 and 4. Similarly, thresholding the third eigenvector separates 
clusters 1 and 4 from clusters 2 and 3, and thresholding the fourth eigenvector separates clusters 1 
and 3 from clusters 2 and 4. Altogether, the first four eigenvectors carry all the information about the 
four clusters. In all the cases illustrated in this figure, spectral clustering using A;-means on the first 
four eigenvectors easily detects the correct four clusters. 



5 Graph cut point of view 

The intuition of clustering is to separate points in different groups according to their similarities. For 
data given in form of a similarity graph, this problem can be restated as follows: we want to find a par- 
tition of the graph such that the edges between different groups have a very low weight (which means 
that points in different clusters are dissimilar from each other) and the edges within a group have high 
weight (which means that points within the same cluster axe similar to each other). In this section we 
will see how spectral clustering can be derived as an approximation to such graph partitioning problems. 

Given a similarity graph with adjacency matrix W , the simplest and most direct way to construct 
a partition of the graph is to solve the mincut problem. To define it, please recall the notation 
W{A, B) := Xlie^.ieB ''^ij ^ ^'^^ complement of A. For a given number k of subsets, the 
mincut approach simply consists in choosing a partition Ai,. . . ,Ak which minimizes 



1 - 
cut{Ai,...,Ak) := -Y,W{Ai,Ai). 



2 

Here we introduce the factor 1/2 for notational consistency, otherwise we would count each edge twice 
in the cut. In particular for A; = 2, mincut is a relatively easy problem and can be solved efficiently, 
see Stoer and Wagner (1997) and the discussion therein. However, in practice it often does not lead 
to satisfactory partitions. The problem is that in many cases, the solution of mincut simply separates 
one individual vertex from the rest of the graph. Of course this is not what we want to achieve in 
clustering, as clusters should be reasonably large groups of points. One way to circumvent this problem 
is to explicitly request that the sets Ai, . . . , Ak are "reasonably large" . The two most common objective 
functions to encode this are RatioCut (Hagen and Kahng, 1992) and the normalized cut Ncut (Shi 
and Malik, 2000). In RatioCut, the size of a subset A of a graph is measured by its number of vertices 
\A\, while in Ncut the size is measured by the weights of its edges vol{A). The definitions are: 

' W{A,A) Acut(A„3i) 



RatioCut(A„ ...,A,):=Iy: ^^^^ = E 



l^^l U l^^l 

Ncut{Au...,A,) ^ol(A,) ^E ^ol(^,) • 

Note that both objective functions take a small value if the clusters Ai are not too small. In partic- 
ular, the minimum of the function X]J=i(l/|^i|) is achieved if all \Ai\ coincide, and the minimum of 
^*L^(1/ vol(^i)) is achieved if all vol(^i) coincide. So what both objective functions try to achieve is 
that the clusters are "balanced" , as measured by the number of vertic;es or edge weights, respectively. 
Unfortunately, introducing balancing conditions makes the previously simple to solve mincut problem 
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become NP haxd, see Wagner and Wagner (1993) for a discussion. Spectral clustering is a way to 
solve relaxed versions of those problems. We will see that relaxing Ncut leads to normalized spectral 
clustering, while relaxing RatioCut leads to unnormalized spectral clustering (see also the tutorial 
slides by Ding (2004)). 



5.1 Approximating RatioCut for k = 2 

Let us start with the case of RatioCut and k = 2, because the relaxation is easiest to understand in 
this setting. Our goal is to solve the optimization problem 



niin RatioCut (j4, j4). 



(1) 



We first rewrite the problem in a more convenient form. Given a subset A C ^ we define the vector 
/=(/!,...,/„)' e R" with entries 



fi 



\A\/\A\ i{v,€A 
^J\A\/\A\ if vi e A. 



(2) 



Now the RatioCut objective function c;an be conveniently rewritten using the unnormalized graph 
Laplacian. This is due to the following calculation: 




, , -r, f\A\ \A 



= cut(AA)^"^" + "^ 



1 1^1 +1^1 

V 1^1 \A\ 



= I F| - RatioCut (A, A). 
Additionally, we have 



i=l ieA 




E 




\A\ 




\A\ 




= 0. 



In other words, the vector / as defined in Equation (2) is orthogonal to the constant one vector 1. 
Finally, note that / satisfies 



i=l 



A\ 



Altogether we can see that the problem of minimizing (1) can be equivalently rewritten as 



min f'Lf subject to / _L 1, /j as defined in Eq. (2), 
Acv 



/n. 



(3) 



This is a discrete optimization problem as the entries of the solution vector / are only allowed to take 
two particular values, and of course it is still NP hard. The most obvious relaxation in this setting is 
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to discard the discreteness condition and instead allow that /j takes arbitrary values in R. This leads 
to the relaxed optimization problem 

min f'Lf subject to / ± 1, ||/|| = Vn. (4) 

By the Rayleigh-Ritz theorem (e.g.. sec Section 5.5.2. of Liitkcpohl. 1997) it can be seen immediately 
that the solution of this problem is given by the vector / which is the eigenvector corresponding to 
the second smallest eigenvalue of L (recall that the smallest eigenvalue of L is with eigenvector 1). 
So we can approximate a minimizer of RatioCut by the second eigenvector of L. However, in order 
to obtain a partition of the graph we need to re-transform the real-valued solution vector / of the 
relaxed problem into a discrete indicator vector. The simplest way to do this is to use the sign of / as 
indicator function, that is to choose 

v^eA if > 
ViGA a fi< 0. 

However, in particular in the case of A: > 2 treated below, this heuristic is too simple. What most 
spectral clustering algorithms do instead is to consider the coordinates fi as points in IR and cluster 
them into two groTips C, C by the fc- means clustering algorithm. Then we carry over the resulting 
clustering to the underlying data points, that is we choose 

V, e A if /, e C 
v^eA if /i e C. 

This is exactly the unnormalized spectral clustering algorithm for the case of fc = 2. 



5.2 Approximating RatioCut for cirbitrary k 

The relaxation of the RatioCut minimization problem in the case of a general value k follows a similar 
principle as the one above. Given a partition of V into k sets Ai, . . . , A^, we define fc indicator vectors 
= {hij,...,hn,jy by 



h3 = Sr. • (z = l,...,n; j = l,...,fc). (5) 

1 otherwise 

Then we set the matrix H G ^nxk ^^ic matrix containing those fc indicator vectors as columns. 
Observe that the columns in H are orthonormal to each other, that is H'H = I. Similar to the 
calculations in the last section we can see that 

^.^^_cut(A^_ 
\Ai\ 

Moreover, one can check that 

hf^Lhi = {H'LH)ii. 

Combining those facts we get 

k k 

RatioCut(Ai, ...,Ak) = ^h[Lh, = Y^{H'LH)ii = Tr{H'LH), 

i=l i=l 
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where Tr denotes the trace of a matrix. So the problem of minimizing RatioCut(^i, . . . ,Ai.) can be 
rewritten as 

min Ti^H'LH) subject to H' H = I, H as defined in Eq. (5). 

Ai,...,Ak 

Similar to above wc now relax the problem by allowing the entries of the matrix H to take arbitrary 
real values. Then the relaxed problem becomes: 

min Ti(H'LH) subject to if'F = /. 

This is the standard form of a trace minimization problem, and again a version of the Rayleigh-Ritz 
theorem (e.g., see Section 5.2.2.(6) of Liitkepohl, 1997) tells us that the solution is given by choosing 
H as the matrix which contains the first k eigenvectors of L as columns. We can see that the matrix 
H is in fact the matrix U used in the unnormalized spectral clustering algorithm as described in 
Section 4. Again we need to re-convert the real valued solution matrix to a discrete partition. As 
above, the standard way is to use the fc-means algorithms on the rows of U. This leads to the general 
unnormalized spectral clustering algorithm as presented in Section 4. 



5.3 Approximating Ncut 

Techniques very similar to the ones used for RatioCut can be used to derive normalized spectral 
clustering as relaxation of minimizing Ncut. In the case fc = 2 we define the cluster indicator vector / 

by 

(6) 




Similar to above one can check that {Df)'l = 0, f'Df = vol(F), and f'Lf = yo\{V) Ncut(^, A). Thus 
we can rewrite the problem of minimizing Ncut by the equivalent problem 

mm. f'Lf subject to / as in (6), Df _L 1, f'Df = yo\{V). (7) 

Again we relax the problem by allowing / to take arbitrary real values: 

min f'Lf subject to Df LI, f'Df = vo\{V). (8) 

Now we substitute g := D^l'^f. After substitution, the problem is 

mm g'D-^l'^LD-^l'^g subject to gLD^'H, \gf = No\{y). (9) 

Observe that D^^I'^LD^^l^ = Lsym: D^^'^t is the first eigenvector of Lsym, and vol{V) is a constant. 
Hence, Problem (9) is in the form of the standard Rayleigh-Ritz theorem, and its solution g is given 
by the second eigenvector of Lsym- Re-substituting / = D~^/^g and using Proposition 3 we see that 
/ is the second eigenvector of L^.^,, or equivalently the generalized eigenvector of Lu = XDu. 

For the case of finding k > 2 clusters, we define the indicator vectors hj = {hij, . . . , hnj)' by 



1 otherwisi 



. ' (i = l,...,n; i = l,...,fc). (10) 

otherwise 



12 



Vt Vt-1 



V2k:+1 V3k V3t+1 V4k 



Figure 2: The cockroach graph from Guattery and Miller (1998). 

Then we set the matrix H as the matrix containing those k indicator vectors as columns. Observe that 
H'H = I, h[Dhi = 1, and h[Lhi = cvit{Ai^ Ai) / vol(Ai). So we can write the problem of minimizing 
Ncut as 

^xaisi^ Tt{H' LH) subject to H' DH = I, H as in (10) . 

Relaxing the discreteness condition and substituting T = D^^^H we obtain the relaxed problem 

min Tr{T'D-^^^LD-^/^T) subject to T'T = I. (11) 

Te[R"x'= 

Again this is the standard trace minimization problem which is solved by the matrix T which contains 
the first k eigenvectors of Lgym as columns. Re-substituting H = D~^/'^T and using Proposition 3 we 
see that the solution H consists of the first k eigenvectors of the matrix L^^, or the first k generalized 
eigenvectors of Lu — XDu. This yields the normalized spectral clustering algorithm according to Shi 
and Malik (2000). 



5.4 Comments on the relaxation approach 

There are several comments we should make about this derivation of spectral clustering. Most im- 
portantly, there is no guarantee whatsoever on the quality of the solution of the relaxed problem 
compared to the exact solution. That is, if Ai, . . . , Ak is the exact solution of minimizing RatioCut, and 
i?i , . . . , -Bfc is the solution constructed by unnormalized spectral clustering, then RatioCut (Bi ,Bk)~ 
RatioCut ( Ai, Afc) can be arbitrary large. Several examples for this can be found in Guattery 
and Miller (1998). For instance, the authors consider a very simple class of graphs called "cock- 
roach graphs". Those graphs essentially look like a ladder, with a few rimes removed, see Fig- 
ure 2. Obviously, the ideal RatioCut for fc = 2 just cuts the ladder by a vertical cut such that 
A = {wi, . . .^k,V2k+i, ■ ■ ■ ,V3k}fi^d A = {vk+i, ■ ■ ■ ,V2k,V3k+i,- . ■ , W4fe}. This cut is perfectly balanced 
with \ A\ = \A\ = 2k and c\xi{A,A) = 2. However, by studying the properties of the second eigenvector 
of the unnormalized graph Laplacian of cockroach graphs the authors prove that unnormalized spectral 
clustering always cuts horizontally through the ladder, constructing the sets B = {i^i, . . . , t'2fc} and 
B — {v2k+i, ■ ■ ■ , V4k}- This also results in a balanced cut, but now we cut k edges instead of just 2. 
So RatioCut(A, A) = 2/k, while RatioCut ( B, B) = 1. This means that compared to the optimal cut, 
the RatioCut value obtained by spectral clustering is k/2 times worse, that is a factor in the order of 
n. Several other papers investigate the quality of the clustering constructed by spectral clustering, for 
example Spielman and Teng (1996) (for unnormalized spectral clustering) and Kannan, Vempala, and 
Vetta (2004) (for normalized spectral clustering). In general it is known that efficient algorithms to 
approximate balanced graph cuts up to a constant factor do not exist. To the contrary, this approxi- 
mation problem can be NP hard itself (Bui and Jones, 1992). 
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Of course, the relaxation we discussed above is not unique. For example, a completely different relax- 
ation which leads to a semi-definite program is derived in Bie and Cristianini (2006) , and there might 
be many other useful relaxations. The reason why the spectral relaxation is so appealing is not that 
it leads to particularly good solutions. Its popularity is mainly due to the fact that it results in a 
standard linear algebra problem which is simple to solve. 



6 Random walks point of view 

Another line of argument to explain spectral clustering is based on random walks on the similarity 
graph. A random walk on a graph is a stochastic process which randomly jumps from vertex to vertex. 
We will see below that spectral clustering can be interpreted as trying to find a partition of the graph 
such that the random walk stays long within the same cluster and seldom jumps between clusters. 
Intuitively this makes sense, in particular together with the graph cut explanation of the last section: 
a balanced partition with a low cut will also have the property that the random walk does not have 
many opportunities to jump between clusters. For background reading on random walks in general we 
refer to Norris (1997) and Bremaud (1999), and for random walks on graphs we recommend Aldous 
and Fill (in preparation) and Lovasz (1993). Formally, the transition probability of jumping in one 
step from vertex Vi to vertex vj is proportional to the edge weight Wij and is given by Pij := Wij/di. 
The transition matrix P = {pij)ij=i,...,n of the random walk is thus defined by 



If the graph is connected and non-bipartite, then the random walk always possesses a unique stationary 
distribution tt = (tti, . . . ,7r„)', where tt^ = di /yol(V). Obviously there is a tight relationship between 
Lrw and P, as Lrw = I — P. As a consequence, A is an eigenvalue of Lrw with eigenvector u if and only 
if 1 — A is an eigenvalue of P with eigenvector u. It is well known that many properties of a graph can 
be expressed in terms of the corresponding random walk transition matrix P, see Lovasz (1993) for 
an overview. From this point of view it does not come as a surprise that the largest eigenvectors of P 
and the smallest eigenvectors of Lrw can be used to describe cluster properties of the graph. 

Random walks and Ncut 

A formal equivalence between Ncut and transition probabilities of the random walk has been observed 
in Meila and Shi (2001). 

Proposition 5 (Ncut via transition probabilities) Let G he connected and non bi-partite. As- 
sume that we run the random walk (Xt)tgiH starting with Xq in the stationary distribution it. For 
disjoint subsets A,B cV, denote by P{B\A) := P{Xi e B\Xo e A). Then: 



P = D-^W. 



Ncut(A,^) 



P{A\A) + P{A\A). 



Proof. First of all observe that 



P{XoGA,XiGB)= 



P{Xo = i,X,=j) 



J2 ^iP^i 
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Using this we obtain 



P{Xi e B\Xo e A) 



P{XoeA,XieB) 
P{Xo G A) 



vol{V) 



1 




Now the proposition follows directly with the definition of Ncut. 



□ 



This proposition leads to a nice interpretation of Ncut, and hence of normalized spectral clustering. It 
tells us that when minimizing Ncut, we actually look for a cut through the graph such that a random 
walk seldom transitions from A to ^ and vice versa. 

The commute distcince 

A second connection between random walks and graph Laplacians can be made via the commute dis- 
tance on the graph. The commute distance (also called resistance distance) Cij between two vertices 
Vi and Vj is the expected time it takes the random walk to travel from vertex Vi to vertex Vj and back 
(Lovasz, 1993; Aldous and Fill, in preparation). The commute distance has several nice properties 
which make it particularly appealing for machine learning. As opposed to the shortest path distance 
on a graph, the commute distance between two vertices decreases if there are many different short ways 
to get from vertex Vi to vertex Vj. So instead of just looking for the one shortest path, the commute 
distance looks at the set of short paths. Points which are connected by a short path in the graph and 
lie in the same high-density region of the graph are considered closer to each other than points which 
are connected by a short path but lie in different high-density regions of the graph. In this sense, the 
commute distance seems particularly well-suited to be used for clustering purposes. 

Remarkably, the commute distance on a graph can be computed with the help of the generalized inverse 
(also called pseudo-inverse or Moore- Penrose inverse) of the graph Laplacian L. In the following we 
denote = (0, ... 0, 1, 0, ... , 0)' as the i-th unit vector. To define the generalized inverse of L, recall 
that by Proposition 1 the matrix L can be decomposed a,s L = UAU' where U is the matrix containing 
all eigenvectors as cohimns and A the diagonal matrix with the eigenvalues Ai, . . . , A„ on the diagonal. 
As at least one of the eigenvalues is 0, the matrix L is not invertible. Instead, we define its generalized 
inverse as := UA^U' where the matrix A^ is the diagonal matrix with diagonal entries l/Aj if A, ^ 
and if Ai = 0. The entries of can be computed as 4j = X]fc=2 'X^'^-'-ik^jk- The matrix is positive 
semi-definite and symmetric. For further properties of see Gutman and Xiao (2004). 

Proposition 6 (Commute distance) Let G = {V, E) a connected, undirected graph. Denote by Cij 
the commute distance between vertex Vi and vertex Vj, and by = {llj)i,j=i,...,n the generalized inverse 
of L. Then we have: 



This result has been published by Klein and Randic (1993), where it has been proved by methods of 
electrical network theory. For a proof using first step analysis for random walks see Fouss, Pirottc, Ren- 
ders, and Saerens (2007) . There also exist other ways to express the commute distance with the help 
of graph Laplacians. For example a method in terms of eigenvectors of the normalized Laplacian Lsym 
can be found as Corollary 3.2 in Lovasz (1993), and a method computing the commute distance with 
the help of determinants of certain sub-matrices of L can be found in Bapat, Gutman, and Xiao (2003). 

Proposition 6 has an important consequence. It shows that ^ycij can be considered as a Euclidean 
distance function on the vertices of the graph. This means that we can construct an embedding which 



voJ 



>KV){lli - 2l\^ + 4) = vol(F)(ei - e^YL^iei - ej). 
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maps the vertices Vj of the graph on points Zi € R" such that the EucUdean distances between the 
points Zi coincide with the commute distances on the graph. This works as follows. As the matrix 
is positive semi-definite and symmetric, it induces an inner product on W (or to be more formal, it 
induces an inner product on the subspace of R" which is perpendicular to the vector 1). Now choose 
Zi as the point in R" corresponding to the i-th row of the matrix ?7(A^)^/^. Then, by Proposition 6 
and by the construction of we have that {zi,Zj) = e^L'^ej and Cij = yol{V)\\zi — Zj\\^. 

The embedding used in unnormalized spectral clustering is related to the commute time embedding, 
but not identical. In spectral clustering, we map the vertices of the graph on the rows yi of the matrix 
U, while the commute time embedding maps the vertices on the rows Zi of the matrix (A^)^/^J7. That 
is, compared to the entries of yi, the entries of Zi are additionally scaled by the inverse eigenvalues 
of L. Moreover, in spectral clustering we only take the first k columns of the matrix, while the com- 
mute time embedding takes all columns. Several authors now try to justify why yi and Zi are not 
so different after all and state a bit hand-waiving that the fact that spectral clustering constructs 
clusters based on the Euclidean distances between the yi can be interpreted as building clusters of the 
vertices in the graph based on the commute distance. However, note that both approaches can differ 
considerably. For example, in the optimal case where the graph consists of k disconnected components, 
the first k eigenvalues of L are according to Proposition 2, and the first k columns of U consist of 
the cluster indicator vectors. However, the first k columns of the matrix {A'^y^^U consist of zeros 
only, as the first k diagonal elements of A''^ are 0. In this case, the information contained in the first 
k columns of U is completely ignored in the matrix {A^Y^^U, and all the non-zero elements of the 
matrix {A^y^^U which can be found in columns fc + 1 to n are not taken into account in spectral 
clustering, which discards all those columns. On the other hand, those problems do not occur if the 
underlying graph is connected. In this case, the only eigenvector with eigenvalue is the constant one 
vector, which can be ignored in both cases. The eigenvectors corresponding to small eigenvalues Aj 
of L are then stressed in the matrix (A^)^/^{7 as they are multiplied by A| = l/Aj. In such a situa- 
tion, it might be true that the commute time embedding and the spectral embedding do similar things. 

All in all, it seems that the commute time distance can be a helpful intuition, but without making 
further assumptions there is only a rather loose relation between spectral clustering and the commute 
distance. It might be possible that those relations can be tightened, for example if the similarity 
function is strictly positive definite. However, we have not yet seen a precise mathematical statement 
about this. 

7 Perturbation theory point of view 

Perturbation theory studies the question of how eigenvalues and eigenvectors of a matrix A change if 
we add a small perturbation H, that is we consider the perturbed matrix A := A + H. Most perturba- 
tion theorems state that a certain distance between eigenvalues or eigenvectors of A and A is bounded 
by a constant times a norm of H. The constant usually depends on which eigenvalue we are looking 
at, and how far this eigenvalue is separated from the rest of the spectrum (for a formal statement see 
below). The justification of spectral clustering is then the following: Let us first consider the "ideal 
case" where the between-cluster similarity is exactly 0. We have seen in Section 3 that then the first 
k eigenvectors of L or Lrw are the indicator vectors of the clusters. In this case, the points yi G R'' 
constructed in the spectral clustering algorithms have the form (0, . . . , 0, 1, 0, . . . 0)' where the position 
of the 1 indicates the connected component this point belongs to. In particular, all yi belonging to the 
same connected component coincide. The fc-means algorithm will trivially find the correct partition 
by placing a center point on each of the points (0, . . . , 0, 1, 0, . . . 0)' G R'^. In a "nearly ideal case" 
where we still have distinct clusters, but the between-cluster similarity is not exactly 0, we consider 
the Laplacian matrices to be perturbed versions of the ones of the ideal case. Perturbation theory then 
tells us that the eigenvectors will be very close to the ideal indicator vectors. The points yi might not 
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completely coincide with (0, . . . , 0, 1, 0, . . .0)', but do so up to some small error term. Hence, if the 
perturbations are not too large, then fc-means algorithm will still separate the groups from each other. 



7.1 The formal perturbation argument 

The formal basis for the perturbation approach to spectral clustering is the Davis-Kahan theorem from 
matrix perturbation theory. This theorem bounds the difference between eigenspaces of symmetric 
matrices under perturbations. Wc state those results for completeness, but for background reading we 
refer to Section V of Stewart and Sun (1990) and Section VII. 3 of Bhatia (1997). In perturbation theory, 
distances between subspaces are usually measured using "canonical angles" (also called "principal 
angles"). To define principal angles, let Vi and V2 be two p-dimcnsional subspaccs of R'', and Vi and 
V2 two matrices such that their columns form orthonormal systems for Vi and V2, respectively. Then 
the cosines cos6j of the principal angles 6j are the singular values of V(V2. For p = 1, the so defined 
canonical angles coincide with the normal definition of an angle. Canonical angles can also be defined 
if Vi and V2 do not have the same dimension, see Section V of Stewart and Sun (1990), Section VII. 3 of 
Bhatia (1997), or Section 12.4.3 of Golub and Van Loan (1996). The matrbc sin0(Vi, V2) will denote 
the diagonal matrix with the sine of the canonical angles on the diagonal. 

Theorem 7 (Davis-Keihein) Let A,Hq ^nxn symmetric matrices, and let || • || he the Frohenius 
norm or the two-norm for matrices, respectively. Consider A := A + H as a perturbed version of A. 
Let Si G R be an interval. Denote by asi{A) the set of eigenvalues of A which are contained in Si, 
and by Vi the eigenspace corresponding to all those eigenvalues (m,ore formally, Vi is the image of 
the spectral projection induced by (7s-^(A)). Denote by as-i^{A) and Vi the analogous quantities for A. 
Define the distance between Si and the spectrum of A outside of Si as 

5 = min{|A — s\; A eigenvalue of A, A ^ Si, s € ^i}. 
Then the distance d{Vi, Vi) := || sin©(Fi, Vi)|| between the two subspaces Vi and Vi is bounded by 

d{Vi,Vi)<^^. 

For a discussion and proofs of this theorem see for example Section V.3 of Stewart and Sun (1990). 
Let us try to decrypt this theorem, for simplicity in the case of the unnormalized Laplacian (for the 
normalized Laplacian it works analogously). The matrix A will correspond to the graph Laplacian 
L in the ideal case where the graph has k connected components. The matrix A corresponds to a 
perturbed case, where due to noise the k components in the graph axe no longer completely discon- 
nected, but they are only connected by few edges with low weight. We denote the corresponding graph 
Laplacian of this case by L. For spectral clustering we need to consider the first k eigenvalues and 
eigenvectors of L. Denote the eigenvalues of L by Ai, . . . A„ and the ones of the perturbed Laplacian 
L by Ai, . . . , A„. Choosing the interval is now the crucial point. We want to choose it such that 
both the first k eigenvalues of L and the first k eigenvalues of L are contained in Si. This is easier 
the smaller the perturbation H = L — L and the larger the eigengap |Afe — Xk+i\ is. If we manage 
to find such a set, then the Davis-Kahan theorem tells us that the eigenspaces corresponding to the 
first k eigenvalues of the ideal matrix L and the first k eigenvalues of the perturbed matrix L are very 
close to each other, that is their distance is bounded by ||/(5. Then, as the eigenvectors in the ideal 
case are piecewise constant on the connected components, the same will approximately be true in the 
perturbed case. How good "approximately" is depends on the norm of the perturbation and the 
distance 6 between Si and the (fc + l)st eigenvector of L. If the set has been chosen as the interval 
[0, Afc], then 5 coincides with the spectral gap \Xk+i — Xk\- We can see from the theorem that the larger 
this eigengap is, the closer the eigenvectors of the ideal case and the perturbed case are, and hence the 
better spectral clustering works. Below we will see that the size of the eigengap can also be used in a 
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different context as a quality criterion for spectral clustering, namely when choosing the number k of 
clusters to construct. 

If the perturbation H is too large or the eigengap is too small, we might not find a set such that 
both the first k eigenvalues of L and L are contained in Si. In this case, we need to make a compromise 
by choosing the set to contain the first k eigenvalues of L, but maybe a few more or less eigenvalues 
of L. The statement of the theorem then becomes weaker in the sense that either we do not compare 
the cigcnspaces corresponding to the first k eigenvectors of L and L, but the eigenspaces corresponding 
to the first k eigenvectors of L and the first k eigenvectors of L (where k is the number of eigenvalues 
of L contained in 5*1 ). Or, it can happen that 5 becomes so small that the bound on the distance 
between d{Vi, Vi) blows up so much that it becomes useless. 

7.2 Comments about the perturbation approach 

A bit of caution is needed when using perturbation theory arguments to justify clustering algorithms 
based on eigenvectors of matrices. In general, any block diagonal symmetric matrix has the property 
that there exists a basis of eigenvectors which are zero outside the individual blocks and real-valued 
within the blocks. For example, based on this argument several authors use the eigenvectors of the 
similarity matrix S or adjacency matrix W to discover clusters. However, being block diagonal in the 
ideal case of completely separated clusters can be considered as a necessary condition for a successful 
use of eigenvectors, but not a sufficient one. At least two more properties should be satisfied: 

First, we need to make sure that the order of the eigenvalues and eigenvectors is meaningful. In case 

of the Laplacians this is always true, as we know that any connected component possesses exactly one 
eigenvector which has eigenvalue 0. Hence, if the graph has k connected components and we take the 
first k eigenvectors of the Laplacian, then we know that we have exactly one eigenvector per compo- 
nent. However, this might not be the case for other matrices such as S or W. For example, it could be 
the case that the two largest eigenvalues of a block diagonal similarity matrix S come from the same 
block. In such a situation, if we take the first k eigenvectors of S, some blocks will be represented 
several times, while there are other blocks which we will miss completely (unless we take certain pre- 
cautions). This is the reason why using the eigenvectors of 5 or W for clustering should be discouraged. 

The second property is that in the ideal case, the entries of the eigenvectors on the components should 
be "safely bounded away" from 0. Assume that an eigenvector on the first connected component has 
an entry ui^i > at position i. In the ideal case, the fact that this entry is non-zero indicates that the 
corresponding point i belongs to the first cluster. The other way round, if a point j docs not belong to 
cluster 1, then in the ideal case it should be the case that uij = 0. Now consider the same situation, 
but with perturbed data. The perturbed eigenvector u will usually not have any non-zero component 
any more; but if the noise is not too large, then perturbation theory tells us that the entries uu and 
uij are still "close" to their original values ui^i and uij. So both entries ui^i and uij will take some 
small values, say si and £2- In practice, if those values are very small it is unclear how we should 
interpret this situation. Either wc believe that small entries in u indicate that the points do not belong 
to the first cluster (which then misclassifies the first data point i), or we think that the entries already 
indicate class membership and classify both points to the first cluster (which misclassifies point j). 

For both matrices L and Lrw) the eigenvectors in the ideal situation are indicator vectors, so the second 
problem described above cannot occur. However, this is not true for the matrix igym, which is used 
in the normalized spectral clustering algorithm of Ng et al. (2002). Even in the ideal case, the eigen- 
vectors of this matrix are given as D^/^l^. . If the degrees of the vertices differ a lot, and in particular 
if there are vertices which have a very low degree, the corresponding entries in the eigenvectors are 
very small. To counteract the problem described above, the row-normalization step in the algorithm 
of Ng et al. (2002) comes into play. In the ideal case, the matrix U in the algorithm has exactly one 
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non-zero entry per row. After row-normalization, the matrix T in the algorithm of Ng et al. (2002) 
then consists of the cluster indicator vectors. Note however, that this might not always work out 
correctly in practice. Assume that we have Ui^i = Si and Ui^2 = £2- If we now normalize the i-th row 
of U, both El and £2 will be multiplied by the factor of Xj ^ e\ + £\ and become rather large. We now 
run into a similar problem as described above: both points are likely to be classified into the same 
cluster, even though they belong to different clusters. This argument shows that spectral clustering 
using the matrix Lgym can be problematic if the eigenvectors contain particularly small entries. On 
the other hand, note that such small entries in the eigenvectors only occur if some of the vertices have 
a particularly low degrees (as the eigenvectors of Lgym are given by D^I'^\a?>- One could argue that in 
such a case, the data point should be considered an outlier anyway, and then it does not really matter 
in which cluster the point will end up. 

To summarize, the conclusion is that both unnormalized spectral clustering and normalized spectral 
clustering with Lrw are well justified by the perturbation theory approach. Normalized spectral clus- 
tering with Lgym can also be justified by perturbation theory, but it should be treated with more care 
if the graph contains vertices with very low degrees. 



8 Practical details 

In this section we will briefly discuss some of the issues which come up when actually implementing 
spectral clustering. There are several choices to be made and parameters to be set. However, the 
discTission in this section is mainly meant to raise awareness about the general problems which an 
occur. For thorough studies on the behavior of spectral clustering for various real world tasks we refer 
to the literature. 

8.1 Constructing the similarity graph 

Constructing the similarity graph for spectral clustering is not a trivial task, and little is known on 
theoretical implications of the various constructions. 

The simileirity function itself 

Before we can even think about constructing a similarity graph, we need to define a similarity function 
on the data. As we are going to construct a neighborhood graph later on, we need to make sure that the 
local neighborhoods induced by this similarity function are "meaningful" . This means that we need to 
be sure that points which are considered to be "very similar" by the similarity function are also closely 
related in the application the data comes from. For example, when constructing a similarity function 
between text documents it makes sense to check whether documents with a high similarity score indeed 
belong to the same text category. The global "long-range" behavior of the similarity function is not so 
important for spectral clustering it does not really matter whether two data points have similarity 
score 0.01 or 0.001, say, as we will not connect those two points in the similarity graph anyway. In the 
common case where the data points live in the Euclidean space R'^, a reasonable default candidate is 
the Gaussian similarity function s(xi, Xj) = exp(— — a;j|p/(2(j^)) (but of course we need to choose 
the parameter cr here, see below). Ultimately, the choice of the similarity function depends on the 
domain the data comes from, and no general advice can be given. 

Which type of similetrity graph 

The next choice one has to make concerns the type of the graph one wants to use, such as the fc-nearest 

neighbor or the e-neighborhood graph. Let us illustrate the behavior of the different graphs using the 
toy example presented in Figure 3. As underlying distribution we choose a distribution on with 
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Figure 3: Different similarity graphs, see text for details. 



three clusters: two "moons" and a Gaussian. The density of the bottom moon is chosen to be larger 
than the one of the top moon. The upper left panel in Figure 3 shows a sample drawn from this 
distribution. The next three panels show the different similarity graphs on this sample. 

In the £-neighborhood graph, we can see that it is difficult to choose a useful parameter e. With 
e = 0.3 as in the figure, the points on the middle moon are already very tightly connected, while the 
points in the Gaussian are barely connected. This problem always occurs if we have data "on different 
scales" , that is the distances between data points are different in different regions of the space. 

The /c-nearest neighbor graph, on the other hand, can connect points "on different scales" . We can 
see that points in the low-density Gaussian are connected with points in the high-density moon. This 
is a general property of fc-nearest neighbor graphs which can be very useful. We can also see that the 
/c-nearest neighbor graph can break into several disconnected components if there are high density re- 
gions which are reasonably far away from each other. This is the case for the two moons in this example. 

The mutual fc-nearest neighbor graph has the property that it tends to connect points within regions 
of constant density, but does not connect regions of different densities with each other. So the mutual 
/c-nearest neighbor graph can be considered as being "in between" the e-neighborhood graph and the 
/c-nearest neighbor graph. It is able to act on different scales, but does not mix those scales with each 
other. Hence, the mutual fc-nearest neighbor graph seems particularly well-suited if we want to detect 
clusters of different densities. 

The fully connected graph is very often used in connection with the Gaussian similarity function 
s{xi,Xj) = exp(— ll^i — a;j|p/(2(T^)). Here the parameter a plays a similar role as the parameter e in 
the £-neighborhood graph. Points in local neighborhoods are connected with relatively high weights, 
while edges between far away points have positive, but negligible weights. However, the resulting 
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similarity matrix is not a sparse matrix. 

As a general recommendation we suggest to work with the fc-nearest neighbor graph as the first choice. 
It is simple to work with, results in a sparse adjacency matrix W, and in our experience is less 
vulnerable to unsuitable choices of parameters than the other graphs. 

The pctrameters of the simileirity graph 

Once one has decided for the type of the similarity graph, one has to choose its connectivity parameter 
k or £, respectively. Unfortunately, barely any theoretical results are known to guide us in this task. In 
general, if the similarity graph contains more connected components than the number of clusters we ask 
the algorithm to detect, then spectral clustering will trivially return connected components as clusters. 
Unless one is perfectly sure that those connected components are the correct clusters, one should make 
sure that the similarity graph is connected, or only consists of "few" connected components and very 
few or no isolated vertices. There are many theoretical results on how connectivity of random graphs 
can be achieved, but all those results only hold in the limit for the sample size n oo. For example, 
it is known that for n data points drawn i.i.d. from some underlying density with a connected support 
in R'', the fc-nearest neighbor graph and the mutual fc-nearest neighbor graph will be connected if we 
choose k on the order of log(n) (e.g., Brito, Chavez, Quiroz, and Yukich, 1997). Similar arguments 
show that the parameter e in the e-neighborhood graph has to be chosen as (log(n)/n)'' to guarantee 
connectivity in the limit (Penrose, 1999). While being of theoretical interest, all those results do not 
really help us for choosing fc on a finite sample. 

Now let us give some rules of thumb. When working with the /s-nearest neighbor graph, then the 
connectivity parameter should be chosen such that the resulting graph is connected, or at least has 
significantly fewer connected components than clusters we want to detect. For small or medium-sized 
graphs this can be tried out "by foot" . For very large graphs, a first approximation could be to choose 
k in the order of log(n), as suggested by the asymptotic connectivity results. 

For the mutual fc-nearest neighbor graph, we have to admit that we are a bit lost for rules of thumb. 
The advantage of the mutual fc-nearest neighbor graph compared to the standard fc-nearest neighbor 
graph is that it tends not to connect areas of different density. While this can be good if there are clear 
clusters induced by separate high-density areas, this can hurt in less obvious situations as disconnected 
parts in the graph will always be chosen to be clusters by spectral clustering. Very generally, one can 
observe that the mutual fc-nearest neighbor graph has much fewer edges than the standard fc-nearest 
neighbor graph for the same parameter k. This suggests to choose fc significantly larger for the mutual 
fc-nearest neighbor graph than one would do for the standard fc-nearest neighbor graph. However, to 
take advantage of the property that the mutual fc-nearest neighbor graph does not connect regions 
of difi^erent density, it would be necessary to allow for several "meaningful" disconnected parts of the 
graph. Unfortunately, we do not know of any general heuristic to choose the parameter k such that 
this can be achieved. 

For the ^-neighborhood graph, we suggest to choose e such that the resulting graph is safely connected. 
To determine the smallest value of e where the graph is connected is very simple: one has to c;lioose 
£ as the length of the longest edge in a minimal spanning tree of the fully connected graph on the 
data points. The latter can be determined easily by any minimal spanning tree algorithm. However, 
note that when the data contains outliers this heuristic; will choose s so large that even the outliers 
are connected to the rest of the data. A similar efi^ect happens when the data contains several tight 
clusters which are very far apart from each other. In both cases, e will be chosen too large to reflect 
the scale of the most important part of the data. 

Finally, if one uses a fully connected graph together with a similarity function which can be scaled 
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itself, for example the Gaussian similarity function, then the scale of the similarity function should be 
chosen such that the resulting graph has similar properties as a corresponding fc-nearest neighbor or 
e- neighborhood graph would have. One needs to make sure that for most data points the set of neigh- 
bors with a similarity significantly larger than is "not too small and not too large". In particular, 
for the Gaussian similarity function several rules of thumb are frequently used. For example, one can 
choose (7 in the order of the mean distance of a point to its fc-th nearest neighbor, where k is chosen 
similarly as above (e.g., k ~ log(n) + 1 )• Another way is to determine e by the minimal spanning 
tree heuristic described above, and then choose a = e. But note that all those rules of thumb are very 
ad-hoc, and depending on the given data at hand and its distribution of inter-point distances they 
might not work at all. 

In general, experience shows that spectral clustering can be quite sensitive to changes in the similarity 
graph and to the choice of its parameters. Unfortunately, to our knowledge there has been no sys- 
tematic study which investigates the effects of the similarity graph and its parameters on clustering 
and comes up with well-justified rules of thumb. None of the recommendations above is based on a 
firm theoretic ground. Finding rules which have a theoretical justification should be considered an 
interesting and important topic for future research. 



8.2 Computing the eigenvectors 

To implement spectral clustering in prEictice one has to compute the first k eigenvectors of a potentially 
large graph Laplace matrix. Luckily, if we use the fc-nearest neighbor graph or the e-neighborhood 
graph, then all those matrices are sparse. Efficient methods exist to compute the first eigenvectors 
of sparse matrices, the most popular ones being the power method or Krylov subspace methods such 
as the Lanczos method (Golub and Van Loan, 1996). The speed of convergence of those algorithms 
depends on the size of the eigengap (also called spectral gap) = I'^fe — Afe+i|. The larger this eigengap 
is, the faster the algorithms computing the first k eigenvectors converge. 

Note that a general problem occurs if one of the eigenvalues under consideration has multiplicity larger 
than one. For example, in the ideal situation of k disconnected clusters, the eigenvalue has multi- 
plicity k. As we have seen, in this case the eigenspace is spanned by the k cluster indicator vectors. 
But unfortunately, the vectors computed by the numerical eigensolvers do not necessarily converge to 
those particular vectors. Instead they just converge to some orthonormal basis of the eigenspace, and 
it usually depends on implementation details to which basis exactly the algorithm converges. But this 
is not so bad after all. Note that all vectors in the space spanned by the cluster indicator vectors 1^. 
have the form u = o-i^Ai for some coefficients a^, that is, they are piecewise constant on the 

clusters. So the vectors returned by the eigensolvers still encode the information about the clusters, 
which can then be used by the fc-means algorithm to reconstruct the clusters. 



8.3 The number of clusters 

Choosing the number k of clusters is a general problem for all clustering algorithms, and a variety of 
more or less successful methods have been devised for this problem. In model-based clustering settings 
there exist well-justified criteria to choose the number of clusters from the data. Those criteria are 
usually based on the log-likelihood of the data, which can then be treated in a frequentist or Bayesian 
way, for examples see Fraley and Raftery (2002). In settings where no or few assumptions on the 
underlying model are made, a large variety of different indices can be used to pick the number of 
clusters. Examples range from ad-hoc measures such as the ratio of within-cluster and between-cluster 
similarities, over information-theoretic criteria (Still and Bialek, 2004), the gap statistic (Tibshirani, 
Walther, and Hastie, 2001), to stability approaches (Ben-Hur, Elisseeff, and Guyon, 2002; Lange, Roth, 
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Figure 4: Three data sets, and the smallest 10 eigenvalues of L^w See text for more details. 



Braun, and Buhmann, 2004; Ben-David, von Luxburg, and Pal, 2006). Of course all those methods can 
also be used for spectral clustering. Additionally, one tool which is particularly designed for spectral 
clustering is the eigengap heuristic, which can be used for all three graph Laplacians. Here the goal 
is to choose the number k such that all eigenvalues Ai, . . . , are very small, but A^+i is relatively 
large. There are several justifications for this procedure. The first one is based on perturbation theory, 
where we observe that in the ideal case of k completely disconnected clusters, the eigenvalue has 
multiplicity fc, and then there is a gap to the (k + l)th eigenvalue A^+i > 0. Other explanations can 
be given by spectral graph theory. Here, many geometric invariants of the graph can be expressed or 
bounded with the help of the first eigenvalues of the graph Laplacian. In particular, the sizes of cuts 
are closely related to the size of the first eigenvalues. For more details on this topic we refer to BoUa 
(1991), Mohar (1997) and Chung (1997). 

We would like to illustrate the eigengap heuristic on our toy example introduced in Section 4. For 
this purpose we consider similar data sets as in Section 4, but to vary the difficulty of clustering we 
consider the Gaussians with increasing variance. The first row of Figure 4 shows the histograms of 
the three samples. We construct the 10-nearest neighbor graph as described in Section 4, and plot the 
eigenvalues of the normalized Laplacian ij-w on the different samples (the results for the unnormalized 
Laplacian are similar). The first data set consists of four well separated clusters, and we can see that 
the first 4 eigenvalues are approximately 0. Then there is a gap between the 4th and 5th eigenvalue, 
that is I A5 — A4I is relatively large. According to the eigengap heuristic, this gap indicates that the data 
set contains 4 clusters. The same behavior can also be observed for the results of the fully connected 
graph (already plotted in Figure 1). So we can see that the heuristic works well if the clusters in 
the data are very well pronounced. However, the more noisy or overlapping the clusters are, the less 
effective is this heuristic. We can see that for the second data set where the clusters are more "blurry", 
there is still a gap between the 4th and 5th eigenvalue, but it is not as clear to detect as in the case 
before. Finally, in the last data set, there is no well-defined gap, the differences between all eigenvalues 
are approximately the same. But on the other hand, the clusters in this data set overlap so much that 
many non-parametric algorithms will have difficulties to detect the clusters, unless they make strong 
assumptions on the underlying model. In this particular example, even for a human looking at the 
histogram it is not obvious what the correct number of clusters should be. This illustrates that, as 
most methods for choosing the number of clusters, the eigengap heuristic usually works well if the data 
contains very well pronounced clusters, but in ambiguous cases it also returns ambiguous results. 



23 



Finally, note that the choice of the number of clusters and the choice of the connectivity parameters 
of the neighborhood graph affect each other. For example, if the connectivity parameter of the neigh- 
borhood graph is so small that the graph breaks into, say, ko connecited components, then choosing fcg 
as the number of clusters is a valid choice. However, as soon as the neighborhood graph is connected, 
it is not clear how the number of clusters and the connectivity parameters of the neighborhood graph 
interact. Both the choice of the number of clusters and the choice of the connectivity parameters of 
the graph are difficult problems on their own, and to our knowledge nothing non-trivial is known on 
their interactions. 

8.4 The /c-means step 

The three spectral clustering algorithms wc presented in Section 4 use /c-mcans as last step to extract 
the final partition from the real valued matrix of eigenvectors. First of all, note that there is nothing 
principled about using the fc-means algorithm in this step. In fact, as we have seen from the various 
explanations of spectral clustering, this step should be very simple if the data contains wcU-cxprcssed 
clusters. For example, in the ideal case if completely separated clusters we know that the eigenvectors 
of L and irw are piecewise constant. In this case, all points Xi which belong to the same cluster Cg 
are mapped to exactly the sample point yi, namely to the unit vector G R'^. In such a trivial case, 
any clustering algorithm applied to the points yi G R'^ will be able to extract the correct clusters. 

While it is somewhat arbitrary what clustering algorithm exactly one chooses in the final step of spec- 
tral clustering, one can argue that at least the Euclidean distance between the points y^ is a meaningful 
quantity to look at. We have seen that the Euclidean distance between the points y^ is related to the 
"commute distance" on the graph, and in Nadlcr, Lafon, Coifman, and Kcvrckidis (2006) the authors 
show that the Euclidean distances between the yi are also related to a more general "diffusion dis- 
tance". Also, other uses of the spectral embeddings (e.g., Bolla (1991) or Belkin and Niyogi (2003)) 
show that the Euclidean distance in R*^ is meaningful. 

Instead of fc-means, people also use other techniques to construct he final solution from the real-valued 

representation. For example, in Lang (2006) the authors use hyperplancs for this purpose. A more 
advanced post-processing of the eigenvectors is proposed in Bach and Jordan (2004). Here the authors 
study the subspace spanned by the first k eigenvectors, and try to approximate this subspace as good 
as possible using piecewise constant vectors. This also leads to minimizing certain Euclidean distances 
in the space R*^, which can be done by some weighted fc-means algorithm. 



8.5 Which graph Laplacian should be used? 

A fundamental question related to spectral clustering is the question which of the three graph Lapla- 
cians should be used to compute the eigenvectors. Before deciding this question, one should always 
look at the degree distribution of the similarity graph. If the graph is very regular and most vertices 
have approximately the same degree, then all the Laplacians are very similar to each other, and will 
work equally well for clustering. However, if the degrees in the graph are very broadly distributed, 
then the Laplacians differ considerably. In our opinion, there are several arguments which advocate 
for using normalized rather than unnormalized spectral clustering, and in the normalized case to use 
the eigenvectors of irw rather than those of I/gym- 



Clustering objectives satisfied by the different algorithms 

The first argument in favor of normalized spectral clustering comes from the graph partitioning point 
of view. For simplicity let us discuss the case k = 2. In general, clustering has two different objectives: 
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1. We want to find a partition such tliat points in different clusters are dissimilar to each other, 
that is we want to minimize the between-cluster similarity. In the graph setting, this means to 
minimize cut(^, ^4). 

2. We want to find a partition such that points in the same cluster are similar to each other, that 
is we want to maximize the within-cluster similarities A) and W{A,A). 

Both RatioCut and Ncut directly implement the first objective by explicitly incorporating cut(A, A) 
in the objective function. However, concerning the second point, both algorithms behave differently. 
Note that 



Hence, the within-cluster similarity is maximized if cut(A, A) is small and if vol{A) is large. As this 
is exactly what we achieve by minimizing Ncut, the Ncut criterion implements the second objective. 
This can be seen even more explicitly by considering yet another graph cut objective function, namely 
the MinMaxCut criterion introduced by Ding, He, Zha, Gu, and Simon (2001): 



Compared to Ncut, which has the terms vol(A) = cut{A,A) + W{A,A) in the denominator, the 
MinMaxCut criterion only has 1-'K(A, A) in the denominator. In practice, Ncut and MinMaxCut are 
often minimized by similar cuts, as a good Ncut solution will have a small value of cut{A,A) anyway 
and hence the denominators are not so different after all. Moreover, relaxing MinMaxCut leads to 
exactly the same optimization problem as relaxing Ncut, namely to normalized spectral clustering with 
the eigenvectors of Lrw So one can see by several ways that normalized spectral clustering incorporates 
both clustering objectives mentioned above. 

Now consider the case of RatioCut. Here the objective is to maximize \ A\ and|A| instead of vol(A) and 
vol(A). But \A\ and \ A\ are not necessarily related to the within-cluster similarity, as the within-cluster 
similarity depends on the edges and not on the number of vertices in A. For instance, just think of 
a set A which has very many vertices, all of which only have very low weighted edges to each other. 
Minimizing RatioCut does not attempt to maximize the within-cluster similarity, and the same is then 
true for its relaxation by unnormalized spectral clustering. 

So this is our first important point to keep in mind: Normalized spectral clustering implements both 
clustering objectives mentioned above, while unnormalized spectral clustering only implements the 
first objective. 



Consistency issues 

A completely different argument for the superiority of normalized spectral clustering comes from a sta- 
tistical analysis of both algorithms. In a statistical setting one assumes that the data points cci , . . . , x„ 
have been sampled i.i.d. according to some probability distribution P on some underlying data space 
X. The most fundamental question is then the question of consistency: if we draw more and more data 
points, do the clustering results of spectral clustering converge to a useful partition of the underlying 
space X7 

For both normalized spectral clustering algorithms, it can be proved that this is indeed the case (von 
Luxburg, Bousquet, and Belkin, 2004, 2005; von Luxburg, Belkin, and Bousquet, to appear). Mathe- 
matically, one proves that as we take the limit n — > oo, the matrix -Lgym converges in a strong sense 



W{A, A) = W{A, V) - W{A, A) = vol(A) - cut(A, A). 
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Figure 5: Consistency of unnormalized spectral clustering. Plotted are eigenvalues and eigenvectors 
of L, for parameter a ~ 2 (first row) and a = 5 (second row). The dashed line indicates mindj, the 
eigenvalues below min dj are plotted as red diamonds, the eigenvalues above min dj are plotted as blue 
stars. See text for more details. 



to an operator U on the space C(X) of continuous functions on X . This convergence implies that the 
eigenvalues and eigenvectors of Lsym converge to those of U , which in turn can be transformed to a 
statement about the convergence of normalized spectral clustering. One can show that the partition 
which is induced on X by the eigenvectors of U can be interpreted similar to the random walks inter- 
pretation of spectral clustering. That is, if we consider a diffusion process on the data space X, then 
the partition induced by the eigenvectors of U is such that the diffusion does not transition between the 
different clusters very often (von Luxburg et al., 2004). All consistency statements about normalized 
spectral clustering hold, for both Lsym and Lrw, under very mild conditions which are usually satisfied 
in real world applications. Unfortunately, explaining more details about those results goes beyond the 
scope of this tutorial, so we refer the interested reader to von Luxburg et al. (to appear). 

In contrast to the clear convergence statements for normalized spectral clustering, the situation for 
unnormalized spectral clustering is much more unpleasant. It can be proved that unnormalized spec- 
tral clustering can fail to converge, or that it can converge to trivial solutions which construct clusters 
consisting of one single point of the data space (von Luxburg et al., 2005, to appear). Mathematically, 
even though one can prove that the matrix (l/n)L itself converges to some limit operator T on C{X) 
as n — > oo, the spectral properties of this limit operator T can be so nasty that they prevent the con- 
vergence of spectral clustering. It is possible to construct examples which show that this is not only a 
problem for very large sample size, but that it can lead to completely unreliable results even for small 
sample size. At least it is possible to characterize the conditions when those problem do not occur: We 
have to make sure that the eigenvalues of L corresponding to the eigenvectors used in unnormalized 
spectral clustering are significantly smaller than the minimal degree in the graph. This means that if 
we use the first k eigenvectors for clustering, then Aj ^ minj=i ...^„ dj should hold for all i = 1, . . . , k. 
The mathematical reason for this condition is that eigenvectors corresponding to eigenvalues larger 
than min dj approximate Dirac functions, that is they are approximately in all but one coordinate. 
If those eigenvectors are used for clustering, then they separate the one vertex where the eigenvector 
is non-zero from all other vertices, and we clearly do not want to construct such a partition. Again we 
refer to the literature for precise statements and proofs. 

For an illustration of this phenomenon, consider again our toy data set from Section 4. We consider the 
first eigenvalues and eigenvectors of the unnormalized graph Laplacian based on the fully connected 
graph, for different choices of the parameter a of the Gaussian similarity function (see last row of Fig- 
ure 1 and all rows of Figure 5). The eigenvalues above min dj are plotted as blue stars, the eigenvalues 
below min dj are plotted as red diamonds. The dashed line indicates imn dj. In general, we can see 
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that the eigenvectors corresponding to eigenvalues which are much below the dashed hnes are "useful" 
eigenvectors. In case a = 1 (plotted already in the last row of Figure 1), Eigenvalues 2, 3 and 4 are 
significantly below mm dj, and the corresponding Eigenvectors 2, 3, and 4 arc meaningful (as already 
discussed in Section 4). If we increase the parameter a, we can observe that the eigenvalues tend to 
move towards min dj. In case a = 2, only the first three eigenvalues are below min dj (first row in 
Figure 5), and in case a = 5 only the first two eigenvalues are below min dj (second row in Figure 5). 
We can see that as soon as an eigenvalue gets close to or above mindj, its corresponding eigenvector 
approximates a Dirac function. Of course, those eigenvectors are unsuitable for constructing a clus- 
tering. In the limit for n — !■ oo, those eigenvectors would converge to perfect Dirac; functions. Our 
illustration of the finite sample case shows that this behavior not only occurs for large sample size, 
but can be generated even on the small example in our toy data set. 

It is very important to stress that those problems only concern the eigenvectors of the matrix L, and 
they do not occur for Lr-w or -Z^sym* Thus, from a statistical point of view, it is preferable to avoid 
unnormalized spectral clustering and to use the normalized algorithms instead. 



Which normalized Laplacicin? 

Looking at the differences between the two normalized spcc;tral clustering algorithms using Lrw and 
Lsym, all three explanations of spectral clustering are in favor of irw The reason is that the eigenvec- 
tors of are cluster indicator vectors l^i , while the eigenvectors of Lsym are additionally multiplied 
with Z?^/^, which might lead to undesired artifacts. As using isym also does not have any computa- 
tional advantages, we thus advocate for using L^. 



9 Outlook and further reading 

Spectral clustering goes back to Donath and Hoffman (1973), who first suggested to construct graph 
partitions based on eigenvectors of the adjacency matrix. In the same year, Fiedler (1973) discovered 
that bi-partitions of a graph are closely connected with the second eigenvector of the graph Laplacian, 
and he suggested to use this eigenvector to partition a graph. Since then, spectral clustering has 
been discovered, re-discovered, and extended many times in different communities, see for example 
Pothen, Simon, and Liou (1990), Simon (1991), BoUa (1991), Hagen and Kahng (1992), Hendrickson 
and Leland (1995), Van Driessche and Roose (1995), Barnard, Pothen, and Simon (1995), Spielman 
and Teng (1996), Guattery and Miller (1998). A nice overview over the history of spectral clustering 
can be found in Spielman and Teng (1996). 

In the machine learning community, spectral clustering has been made popular by the works of Shi and 
Malik (2000), Ng et al. (2002), Meila and Shi (2001), and Ding (2004). Subsequently spectral cluster- 
ing has been extended to many non-standard settings, for example spectral clustering applied to the 
co-clustering problem (Dhillon, 2001), spectral clustering with additional side information (Joachims, 

2003) connections between spectral clustering and the weighted kernel- fc- means algorithm (Dhillon. 
Guan, and Kulis. 2005), learning similarity functions based on spectral clustering (Bach and Jordan, 

2004) , or spectral clustering in a distributed environment (Kempe and McSherry. 2004). Also, new 
theoretical insights about the relation of spectral clustering to other algorithms have been found. A 
link between spectral clustering and the weighted kernel /s-means algorithm is described in Dhillon et 
al. (2005). Relations between spectral clustering and (kernel) principal component analysis rely on 
the fact that the smallest eigenvectors of graph Laplacians can also be interpreted as the largest eigen- 
vectors of kernel matrices (Gram matrices). Two different flavors of this interpretation exist: while 
Bengio et al. (2004) interpret the matrix D~^/'^WD~^/'^ as kernel matrix, other authors (Saerens, 
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Fouss, Yen, and Dupont, 2004) interpret the Moore-Penrose inverses of L or -Lgym as kernel matrix. 
Both interpretations can be used to construct (different) out-of-sample extensions for spectral clus- 
tering. Concerning application cases of spectral clustering, in the last few years such a huge number 
of papers has been published in various scientific areas that it is impossible to cite all of them. We 
encourage the reader to query his favorite literature data base with the phrase "spectral clustering" 
to get an impression no the variety of applications. 

The success of spectral clustering is mainly based on the fact that it does not make strong assumptions 
on the form of the clusters. As opposed to fc-means, where the resulting clusters form convex sets (or, 
to be precise, lie in disjoint convex sets of the underlying space), spectral clustering can solve very 
general problems like intertwined spirals. Moreover, spectral clustering can be implemented eSiciently 
even for large data sets, as long as we make sure that the similarity graph is sparse. Once the similarity 
graph is chosen, we just have to solve a linear problem, and there are no issues of getting stuck in local 
minima or restarting the algorithm for several times with different initializations. However, we have 
already mentioned that choosing a good similarity graph is not trivial, and spectral clustering can 
be quite unstable under different choices of the parameters for the neighborhood graphs. So spectral 
clustering cannot serve as a "black box algorithm" which automatically detects the correct clusters 
in any given data set. But it can be considered as a powerful tool which can produce good results if 
applied with care. 

In the field of machine learning, graph Laplac;ians are not only used for clustering, but also emerge 
for many other tasks such as semi-supervised learning (e.g., Chapelle, Scholkopf, and Zien, 2006 for 
an overview) or manifold reconstruction (e.g., Belkin and Niyogi, 2003). In most applications, graph 
Laplacians are used to encode the assumption that data points which are "c;lose'' (i.e., Wij is large) 
should have a "similar" label (i.e., fi « fj). A function / satisfies this assumption if Wij{fi — fj)^ 
is small for all that is f'Lf is small. With this intuition one can use the quadratic form f'Lf 
as a regularizer in a transductive classification problem. One other way to interpret the use of graph 
Laplacians is by the smoothness assumptions they encode. A function / which has a low value of f'Lf 
has the property that it varies only "a little bit" in regions where the data points lie dense (i.e., the 
graph is tightly connected), whereas it is allowed to vary more (e.g., to change the sign) in regions 
of low data density. In this sense, a small value of f'Lf encodes the so called "cluster assumption" 
in semi-supervised learning, which requests that the decision boundary of a classifier should lie in a 
region of low density. 

An intuition often used is that graph Laplacians formally look like a continuous Laplace operator (and 
this is also where the name "graph Laplacian" comes from). To see this, transform a local similarity 
Wij to a distance dij by the relationship Wij = l/dfj and observe that 



looks like a difference quotient. As a consequence, the equation f'Lf = 'Ylij'^ijifi ~ fjY from 
Proposition 1 looks like a discrete version of the quadratic form associated to the standard Laplace 
operator C on R", which satisfies 



This intuition has been made precise in the works of Belkin (2003), Lafon (2004), Hein, Audibert, and 
von Luxburg (2005); M., Audibert, and von Luxburg (2007), Belkin and Niyogi (2005), Hein (2006), 
Gine and Koltchinskii (2005). In general, it is proved that graph Laplacians are discrete versions of 
certain c;ontinuous Laplace operators, and that if the graph Laplacian is construc;ted on a similarity 
graph of randomly sampled data points, then it converges to some continuous Laplace operator (or 
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Laplace-Beltrami operator) on the underlying space. Belkin (2003) studied the first important step of 
the convergence proof, which deals with the convergence of a continuous operator related to discrete 
graph Laplacians to the Laplace-Beltrami operator. His results were generalized from uniform distri- 
butions to general distributions by Lafon (2004). Then in Belkin and Niyogi (2005), the authors prove 
pointwise convergence results for the unnormalized graph Laplacian using the Gaussian similarity func- 
tion on manifolds with uniform distribution. At the same time, Hein et al. (2005) prove more general 
results, taking into account all different graph Laplacians L, Lrwj and Lgym, more general similarity 
functions, and manifolds with arbitrary distributions. In Gine and Koltchinskii (2005), distributional 
and uniform convergence results are proved on manifolds with uniform distribution. Hein (2006) stud- 
ies the convergence of the smoothness functional induced by the graph Laplacians and shows uniform 
convergence results. 

Apart from applications of graph Laplacians to partitioning problems in the widest sense, graph 
Laplacians can also be used for completely different purposes, for example for graph drawing (Koren, 
2005). In fact, there are many more tight connections between the topology and properties of graphs 
and the graph Laplacian matrices than we have mentioned in this tutorial. Now equipped with an 
understanding for the most basic properties, the interested reader is invited to further explore and 
enjoy the huge literature in this field on his own. 
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